Financial and risk management in applied statistics with R

Introduction

The introduction section should relay what you are attempting to accomplish. It would include a statement of the business, science, research, or personal interest you have that leads to analyzing the data you’ve chosen. It should provide enough background to your work such that a reader would not need to load your data to understand your report. Like the simulation project, you can assume the reader is familiar with the course concepts, but not your data. Some things to consider:

What is this data? Where did it come from? What are the variables? Why is it interesting to you? Why are you creating a model for this data? What is the goal of this model?

Methods

The methods section should contain the bulk of your “work.” This section will contain the bulk of the R code that is used to generate the results. Your R code is not expected to be perfect idiomatic R, but it is expected to be understood by a reader without too much effort. Use RMarkdown and code comments to your advantage to explain your code if needed.

This section should contain any information about data preparation that is performed to the original data before modelling. Then you will apply methods seen in class, which may include some of the following but are not limited to:

Multiple linear regression Dummy variables Interaction Residual diagnostics Outlier diagnostics Transformations Polynomial regression Model selection Your task is not to use as many methods as possible. Your task is to use appropriate methods to find a good model that can correctly answer a question about the dataset, and then to communicate your result effectively. Some possible items to be discussed:

Description of the original data file including description of all relevant variables. Description of additional data preparation that you performed. Description of the process you chose to follow. Narrative of your step-by-step decision making process throughout the analysis as you adjusted the model and attempted to validate model assumptions.

# install necessary packages
# install.packages("readxl")
# install.packages("sqldf")
# install.packages("corrplot")

Loading necessary Libraries

library(readr)
library("readxl")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
library(glue)
library("corrplot")
## corrplot 0.84 loaded
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(leaps)
library(MASS)
options(digits=6)
options(warn=-1)

data clean

# price of bitcoin
Bitcoin <- read_csv("Bitcoin Historical Data - Investing.com.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_double(),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Vol = col_character(),
##   Change = col_character(),
##   left = col_double(),
##   right = col_character(),
##   Volume = col_double()
## )
Bitcoin$Date = as.Date(Bitcoin$Date, format="%b %d, %Y")
Bitcoin$month <- strftime(Bitcoin$Date, "%m")
Bitcoin$year <- strftime(Bitcoin$Date, "%Y")
plot(Price ~ Date, data=Bitcoin, type = "l")

plot(Volume ~ Date, data=Bitcoin, type = "l")

# Incumbent president 
President <- read_csv("President.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   president_name = col_character(),
##   president_code = col_double()
## )
President$Date = as.Date(President$Date, format="%b %d, %Y")
plot(president_code ~ Date, data=President)

The president code is a categorical variable that classifies the United State’s incumbent president across 2010 and 2020 into “0” (Barack Obama) and “1” (Donald Trump). We are interested in examining the impacts of these two presidents’ respective trade and monetary policy on the financial market. Particularly, Trump’s infamous “covfefe” tweets have been pushing markets around. We would like to investigate how this affects the traders of Bitcoin as well.

# Interest rate
The_interest_rate_for_the_United_States<- read_csv("The interest rate for the United States, from 2010 - 2020.csv")
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   interest_rate = col_double()
## )
The_interest_rate_for_the_United_States$month <- strftime(The_interest_rate_for_the_United_States$Date, "%m")
The_interest_rate_for_the_United_States$year <- strftime(The_interest_rate_for_the_United_States$Date, "%Y")
plot(interest_rate ~ Date, data=The_interest_rate_for_the_United_States, type = "o")

Interest rate is considered as a crucial factor influencing investors’ willingness to invest. The longer interest rates stay low, the more investors tend to deploy leverage. This March, the price of Bitcoin rallied to $5,940 as the United States Federal Reserve cut interest rates to 0.

# Exchange rates
US_dollar_yuan_exchange_rate <- read_csv("US dollar-yuan exchange rate.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   exchange_rate = col_double()
## )
US_dollar_yuan_exchange_rate$Date = as.Date(US_dollar_yuan_exchange_rate$Date, format="%m/%d/%y")
plot(exchange_rate ~ Date, data = US_dollar_yuan_exchange_rate, type = 'l')

Euro_to_US_exchange_rate <- read_csv("euro-dollar-exchange-rate-historical-chart.csv")
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   exchange_rate = col_double(),
##   X3 = col_double()
## )
Euro_to_US_exchange_rate$Date = as.Date(Euro_to_US_exchange_rate$Date, format="%m/%d/%y")
Euro_to_US_exchange_rate$exchange_rate = 1/Euro_to_US_exchange_rate$exchange_rate
plot(exchange_rate  ~ Date, data = Euro_to_US_exchange_rate, type = 'l')

Exchange rates also play a relevant role in the financial market as they affect foregin investors’ sentiment and the transaction costs. For example, China has introduced several capital controls with regards to Bitcoin speculation. Given that more than 90% ofthe bitcoins are traded with the yuan, and more than 70% of the mining takes places in China, we believe the exchange rate of the US dollar with the yuan is an influential factor in determining Bitcoin prices.

# Gold price
Gold <- read_csv("Gold Futures Historical Data.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_double(),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Vol. = col_character(),
##   `Change %` = col_character()
## )
Gold$Date = as.Date(Gold$Date, format="%b %d, %Y")
plot(Price ~ Date, data=Gold, type='l')

Gold has been a hedge and safe haven for investors. Since Bitcoins usually react positively to great financial movements, we would like to study bitcoin’s relationship with precious metals price in a dynamic environment and investigate whether bitcoins can be a hedge or safe haven asset under market uncertainty scenarios.

This dataset we will use the Volume and Price columns.

# Stock market indices
Dow_Jones_Industrial_Average <- read_csv("Dow Jones Industrial Average Historical Data.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_number(),
##   Open = col_number(),
##   High = col_number(),
##   Low = col_number(),
##   Vol. = col_character(),
##   `Change %` = col_character()
## )
Dow_Jones_Industrial_Average$Date = as.Date(Dow_Jones_Industrial_Average$Date, format="%b %d, %Y")
plot(Price ~ Date, data=Dow_Jones_Industrial_Average, type = 'l')

NYSE_Composite_Index <- read_csv("NYSE Composite Historical Data.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_number(),
##   Open = col_number(),
##   High = col_number(),
##   Low = col_number(),
##   Vol. = col_character(),
##   `Change %` = col_character()
## )
NYSE_Composite_Index$Date = as.Date(NYSE_Composite_Index$Date, format="%b %d, %Y")
plot(Price ~ Date, data=NYSE_Composite_Index, type = 'l')

Dow Jones Industrial Average and NYSE Composite Index are major U.S. stock indices. We want to investigate the relationship between Bitcoin and these stock market barometers and examine whether they behave differently under market turmoil.

This dataset we will use the Adj Close and volume columns.

# Oil prices
WTI_oil_price <- read_csv("Crude Oil WTI Futures Historical Data.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_double(),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Vol. = col_character(),
##   `Change %` = col_character()
## )
WTI_oil_price$Date = as.Date(WTI_oil_price$Date, format="%b %d, %Y")
# colnames(WTI_oil_price)[2] <- "WTI_oil_price"
plot(Price ~ Date, data=WTI_oil_price, type = 'l')

Brent_Oil_price <- read_csv("Brent Oil Futures Historical Data.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_double(),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Vol. = col_character(),
##   `Change %` = col_character()
## )
Brent_Oil_price$Date = as.Date(Brent_Oil_price$Date, format="%b %d, %Y")
# colnames(Brent_Oil_price)[2] <- "Brent_oil_price"
plot(Price ~ Date, data=Brent_Oil_price, type = 'l')

Oil price plays an important role in the economy due to increasing global demand for oil to support industrialization and urbanization. Higher oil prices tend to make production more expensive for businesses and therefore raise commodity prices. Oil price increases are also generally thought to increase inflation and reduce economic growth. We want to see how Bitcoin is related to this market influencer.

Data Cleaning and preparing

Combine all those csv to a one giant dataframe

Because we are using gold price, stock price, oil price, etc to predict Bitcoin price, it’s reasonable to expect a few days delay for the Bitcoin price to reflect what happen in stock and futures markets. For this model, we set the reponse time for Bitcoin price to 15 days, which assumes that Bitcoin price will react about 15 days later than the gold price, stock price, oil price, etc changed.

sql =  glue("
select Bitcoin.Date,
       Bitcoin.Price                as Bitcoin_price,
       Bitcoin.Volume               as Transaction_volume,
       exchange_China.exchange_rate as USD_Yuan_exchange_rate,
       exchange_Euro.exchange_rate  as USD_Euro_Exchange_Rate,
       DJIA.Price                   as DJIA_index,
       NYSE.Price                   as NYSE_index,
       Brent_Oil_price.Price        as brent_oil_price,
       WTI_oil_price.Price          as WTI_oil_price,
       Gold.Price                   as Gold_price,
       President.president_code,
       interest.interest_rate,
       Bitcoin.month,
       Bitcoin.year
from Bitcoin
         left outer join US_dollar_yuan_exchange_rate as exchange_China
                         on exchange_China.Date = Bitcoin.Date - 15
         left outer join Euro_to_US_exchange_rate as exchange_Euro
                         on exchange_Euro.Date = Bitcoin.Date - 15
         left outer join Dow_Jones_Industrial_Average as DJIA
                         on DJIA.Date = Bitcoin.Date - 15
         left outer join NYSE_Composite_Index as NYSE
                         on NYSE.Date = Bitcoin.Date - 15
         left outer join Brent_Oil_price
                         on Brent_Oil_price.Date = Bitcoin.Date - 15
         left outer join WTI_oil_price
                         on WTI_oil_price.Date = Bitcoin.Date - 15
         left outer join President
                         on President.Date = Bitcoin.Date - 15
         left outer join Gold
                         on Gold.Date = Bitcoin.Date - 15
         left outer join The_interest_rate_for_the_United_States as interest
                         on interest.month = Bitcoin.month
                             and interest.year = Bitcoin.year
where Bitcoin.Date between {c (1.0, as.Date('2014-06-01'))[2]} and {c (1.0, as.Date('2020-04-30'))[2]}
order by Bitcoin.Date asc")
combined_daily = sqldf(sql)
combined_daily$Date_posix = c(1, as.Date(combined_daily$Date))[-c(1)]
combined_daily$president_code = as.factor(combined_daily$president_code)
combined_weekdays = na.omit(combined_daily)
monthly_volume_df = sqldf("
select month, year, sum(combined_daily.Transaction_volume) as Transaction_volume
       from combined_daily
       group by combined_daily.month, combined_daily.year")

combined_monthly = sqldf("
select Date,
       Bitcoin_price,
       monthly_volume_df.Transaction_volume,
       USD_Yuan_exchange_rate,
       USD_Euro_Exchange_Rate,
       DJIA_index,
       NYSE_index,
       brent_oil_price,
       WTI_oil_price,
       Gold_price,
       president_code,
       interest_rate,
       Date_posix,
       combined_weekdays.year
from combined_weekdays
join monthly_volume_df
on monthly_volume_df.month = combined_weekdays.month
and monthly_volume_df.year = combined_weekdays.year
where combined_weekdays.Date in
      (select max(combined_weekdays.Date)
       from combined_weekdays
       group by combined_weekdays.month, combined_weekdays.year)
")

combined_weekdays = combined_weekdays[!(names(combined_weekdays) %in% c("month"))]
# combined_monthly
# plot(combined_monthly$Date, combined_monthly$Bitcoin_price, type = "l", col = "blue")

Testing

Bitcoin did not become popularly traded until 2014, therefore we will use all data from 2014 to 2020 because Bitcoin does not really count as a financial product before 2014. Data before 2014 are ignored.
In order to test the effectiveness of our model, we randomly select half of the observations as our training dataset and the other half as our testing datasets.

data = combined_weekdays
trn_idx = sample(nrow(data), nrow(data) / 2)
monthly_train = combined_monthly[trn_idx, ]
monthly_test = combined_monthly[-trn_idx, ]
weekday_train = combined_weekdays[trn_idx, ]
weekday_test = combined_weekdays[-trn_idx, ]
train_data = weekday_train
# train_data = monthly_train

Helper functions

model_names = c()
model_params_num = c()
model_rss = c()
model_adjr2 = c()
model_rmse = c()
test_rmse = c()

store_result = function(model, model_name, log = FALSE){
  model = model
  rmses = calc_rmse(model, data, trn_idx, log)
  model_names <<- c(model_names, model_name)
  model_params_num <<- c(model_params_num, length(coefficients(model)))
  model_rss <<- c(model_rss, deviance(model))
  model_adjr2 <<- c(model_adjr2, summary(model)$adj.r)
  model_rmse <<- c(model_rmse, rmses[1])
  test_rmse <<- c(test_rmse, rmses[2])
}

calc_rmse = function(model, data, trn_idx, log = FALSE) {
  data_train = data[trn_idx, ]
  data_test = data[-trn_idx, ]
  pred = predict(model, data_train)
  if(log){
    pred = exp(pred)
  }
 
  actual = data_train$Bitcoin_price
  train_rmse = sqrt(mean((pred - actual)^2))

  pred = predict(model, data_test)
  if(log){
    pred = exp(pred)
  }
  actual = data_test$Bitcoin_price
  test_rmse = sqrt(mean((pred - actual)^2))
  c(train_rmse, test_rmse)
}

plots_model = function(model){
  par(mfrow = c(1, 2))
  plot(fitted(model), resid(model), col = "grey", pch = 20,
  xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals")
  abline(h = 0, col = "darkorange", lwd = 2)
  qqnorm(resid(model), col = "darkgrey")
  qqline(resid(model), col = "dodgerblue", lwd = 2)
}

plot_prediction = function(model, data, trn_idx, log=FALSE) {
  par(mfrow = c(1, 1))
  data_test = data[-trn_idx, ]
  pred = predict(model, data)
  if(log){
    pred = exp(pred)
  }

  plot(data$Date, data$Bitcoin_price, type = "l", col = "blue")
  points(data$Date, pred, type = "l", col = "green")
  title(main = deparse(substitute(model)))
  legend("topleft", 
         legend=c("Actually", "Predicted"), 
         col=c("blue", "green"), lty=1:1, cex=0.8, 
         text.font=4, bg='antiquewhite')
}

Investigating the Predictors

Overview of the Dataset

Our potential predictors of Bitcoin price are

  • Transaction Volume
  • US Chinese Yuan exchange rate
  • US European Dollar exhcange rate
  • NYSE Composite Index
  • Dow Jones Industrial Average
  • Brent Oil futures
  • WTI Oil futures
  • Gold futures
  • President terms (Dummy Variable)
  • Federal Interest rates
  • Date time (Every 210,000 blocks mined, or about every four years, the reward given to Bitcoin miners for processing transactions is cut in half) Therefore, theoretically Bitcoin will increase its value as time passes by.
  • Year (Categorical variable) World situation may change every year. Bitcoin price is also affected by black swan events, or pandemic like what we are having now.

Pairs

We’ll use the pairs() plot to determine which variables may benefit from a quadratic or logarithm relationship with the response.

pairs(data[!(names(data) %in% c("Date", "month","year"))], col = "dodgerblue")

### Transformation Next, we will use the boxcox() function to find the best transformation of the form considered by the Box-Cox method.

model = lm(Bitcoin_price ~ . ,data=combined_monthly)
boxcox(model,
       lambda = seq(-0.5, 0.5, by = 0.005),
       plotit = TRUE)

we see a very high log likelihood when λ is from about -0.3 to 0.1, which recommends a transformation of log transformation.

Interaction

In additional to pair plot, we use a correlation matrix to find all possible interaction between each of our predictors.

# find correlation between all of these predictors
log_transform = combined_monthly
log_transform$Transaction_volume = log(log_transform$Transaction_volume)
M <- cor(x=log_transform[!(names(log_transform) %in% c("Date", "month","year", "president_code"))], use = "everything", method="pearson") # get correlations
# M
library('corrplot') #package corrplot
corrplot(M, method = "circle")

It seems that Transaction volume, DJIA_index, NYSE_index and president term, interest rate and Date time has a positive relationship with Bitcoin price. We are going to pay more attention to these predictors.
Meanwhile, NYSE Composite Index and Dow Jones Industrial Average are high correlated to each other. SO are the Brent oil futures and WTI oil futures. These intercorrelation might be harmful to our model.

Results

The results section should contain numerical or graphical summaries of your results. You should report a final model you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. You may use any methods we studied this semester to complete this task, and provide evidence that your final choice of model is a good one. Some possible items to be discussed:

Simplest model, Multiple Regression

Let’s first use a full_mlr_model as a starting point. It is not a good model, but we will try to improve this model using parameter transformation, variable selection and model building techniques

full_mlr_model = lm(Bitcoin_price ~ ., train_data[!(names(data) %in% c("Date", "month"))])
plots_model(full_mlr_model)

store_result(full_mlr_model, "full_mlr_model")

As shown by the above two graph, the variance is not constant, the linearly is also violated. And according to Q-Q plot, Normality of error is violated, too.
We need to find better models.

Polynomial regression

Therefore, we will re-select our predictors using regsubsets().
Because there are exetremely values and peaks in Bitcoin price, We will consider adding quadratic terms to better captures the Bitcoin_price trends.
Since the number of parameters is not too many, we will use a comprehensive search to find the most significant models by calling regsubsets.

library(leaps)
exhaustive_search = summary(regsubsets(Bitcoin_price ~ Transaction_volume + USD_Yuan_exchange_rate + USD_Euro_Exchange_Rate + DJIA_index + NYSE_index + brent_oil_price + WTI_oil_price + Gold_price + president_code + interest_rate + Date_posix + I(Transaction_volume^2) + I(Date_posix ^ 2) + I(USD_Yuan_exchange_rate ^ 2) + I(USD_Euro_Exchange_Rate ^ 2) + I(DJIA_index ^ 2)  + I(NYSE_index ^ 2) + I(brent_oil_price ^ 2) + I(WTI_oil_price ^ 2) + I(Gold_price ^ 2) + I(interest_rate ^ 2), data = data[!(names(data) %in% c("Date", "month"))], nvmax=8))
# exhaustive_search$which
p = length(coef(exhaustive_search))
n = length(resid(exhaustive_search))

We will limit up to 9 of the most significant predictors in our model. If we look at the RSS and Adjusted R-squared, we will also find that RSS and Adjusted R-sqaured are getting better as we increase our number of predictors until nine.

exhaustive_search$rss
## [1] 5953716303 5406212252 4862921126 4720408716 4594155600 4393951274 4326940837
## [8] 4240538876
exhaustive_search$adjr2
## [1] 0.734974 0.759179 0.783231 0.789438 0.794928 0.803728 0.806588 0.810318
(best_r2_ind = which.max(exhaustive_search$adjr2))
## [1] 8
exhaustive_search$which[best_r2_ind, ]
##                 (Intercept)          Transaction_volume 
##                        TRUE                        TRUE 
##      USD_Yuan_exchange_rate      USD_Euro_Exchange_Rate 
##                        TRUE                       FALSE 
##                  DJIA_index                  NYSE_index 
##                        TRUE                        TRUE 
##             brent_oil_price               WTI_oil_price 
##                       FALSE                       FALSE 
##                  Gold_price             president_code1 
##                       FALSE                       FALSE 
##               interest_rate                  Date_posix 
##                       FALSE                       FALSE 
##     I(Transaction_volume^2)             I(Date_posix^2) 
##                       FALSE                       FALSE 
## I(USD_Yuan_exchange_rate^2) I(USD_Euro_Exchange_Rate^2) 
##                       FALSE                        TRUE 
##             I(DJIA_index^2)             I(NYSE_index^2) 
##                        TRUE                        TRUE 
##        I(brent_oil_price^2)          I(WTI_oil_price^2) 
##                       FALSE                       FALSE 
##             I(Gold_price^2)          I(interest_rate^2) 
##                       FALSE                        TRUE

Based on the results of comprehensive search, we have an exhaustive_search_model model.

exhaustive_search_model = lm(Bitcoin_price ~ Transaction_volume + DJIA_index + NYSE_index + Date_posix + I(Date_posix^2) + I(USD_Yuan_exchange_rate^2) + I(USD_Euro_Exchange_Rate^2) + I(DJIA_index^2) + I(NYSE_index^2),  data = train_data)
plots_model(exhaustive_search_model)

store_result(exhaustive_search_model, "exhaustive_search_model")

Unfortunately, linearity and constant variance is still violated. Therefore, we will using logarithm as a variance stabilizing transformation

Logarithmic transformation and Interaction terms

log_model = lm(log(Bitcoin_price) ~ . ,  data = train_data, log = TRUE)
store_result(exhaustive_search_model, "log_model")

log_full_interaction_model = lm(log(Bitcoin_price) ~ (. ^ 2) ,  data = train_data, log = TRUE)
plots_model(log_full_interaction_model)

store_result(exhaustive_search_model, "log_full_interaction_model")

After doing log transformation and adding interaction, The mean of the residuals is around not zero therefore linearity is perserved.
Equal Variance also seems to be preserved because the the variance looks constant. From the Q-Q plot, the distribution of looks much better than previous models. But the normality of error is still violated at extreme values.

plot_prediction(log_full_interaction_model, data, trn_idx, log = TRUE)

Stepwise model selection

Since we just add all interaction terms into our model, it looks like we have a overfitting when around year 2017. Besides, we would like to reduce the number of predictors to simplify our model.

n = length(resid(log_full_interaction_model))
log_full_int_mod_back_bic = step(log_full_interaction_model, direction = "backward", k = log(n), trace = 0)
log_full_int_mod_back_aic = step(log_full_interaction_model, direction = "backward", step=0)
## Start:  AIC=-3561.33
## log(Bitcoin_price) ~ ((Date + Transaction_volume + USD_Yuan_exchange_rate + 
##     USD_Euro_Exchange_Rate + DJIA_index + NYSE_index + brent_oil_price + 
##     WTI_oil_price + Gold_price + president_code + interest_rate + 
##     year + Date_posix)^2)
log_interaction_only_year = lm(log(Bitcoin_price) ~ (. * year) ,  data = train_data)
n = length(resid(log_interaction_only_year))

log_groupby_year_back_bic = step(log_interaction_only_year, direction = "backward", k = log(n), trace = 0)
store_result(log_groupby_year_back_bic, "log_groupby_year_back_bic", log = TRUE)
log_groupby_year_back_aic = step(log_interaction_only_year, direction = "backward", step=0)
## Start:  AIC=-3208.45
## log(Bitcoin_price) ~ ((Date + Transaction_volume + USD_Yuan_exchange_rate + 
##     USD_Euro_Exchange_Rate + DJIA_index + NYSE_index + brent_oil_price + 
##     WTI_oil_price + Gold_price + president_code + interest_rate + 
##     year + Date_posix) * year)
store_result(log_groupby_year_back_aic, "log_groupby_year_back_aic", log = TRUE)

Removing influencial points and alternative models

We will use Cook’s Distance to calculate influential points.

cooksdistance = cooks.distance(log_groupby_year_back_bic)
log_groupby_year_back_bic_fix = lm(log_groupby_year_back_bic,
                   data = data,
                   subset = cooksdistance < 4 / length(cooksdistance))
plot_prediction(log_groupby_year_back_bic_fix, data, trn_idx, log = TRUE)

store_result(log_groupby_year_back_bic_fix, "log_groupby_year_back_bic_fix", log = TRUE)
cooksdistance = cooks.distance(log_groupby_year_back_aic)
log_groupby_year_back_aic_fix = lm(log_groupby_year_back_aic,
                   data = data,
                   subset = cooksdistance < 4 / length(cooksdistance))
plot_prediction(log_groupby_year_back_aic_fix, data, trn_idx, log = TRUE)

store_result(log_groupby_year_back_aic_fix, "log_groupby_year_back_aic_fix", log = TRUE)
cooksdistance = cooks.distance(log_full_int_mod_back_bic)
log_int_mod_back_bic_fix = lm(log_full_int_mod_back_bic,
                   data = data,
                   subset = cooksdistance < 4 / length(cooksdistance))
plots_model(log_int_mod_back_bic_fix)

plot_prediction(log_int_mod_back_bic_fix, data, trn_idx, log = TRUE)

store_result(log_int_mod_back_bic_fix, "log_int_mod_back_bic_fix", log = TRUE)
cooksdistance = cooks.distance(log_full_int_mod_back_aic)
log_int_mod_back_aic_fix = lm(log_full_int_mod_back_aic,
                   data = data,
                   subset = cooksdistance < 4 / length(cooksdistance))
plots_model(log_int_mod_back_aic_fix)

plot_prediction(log_int_mod_back_aic_fix, data, trn_idx, log = TRUE)

store_result(log_int_mod_back_aic_fix, "log_int_mod_back_aic_fix", log = TRUE)
model_list <- data.frame("model names"=model_names, "num of parameters" = model_params_num, "model_adjr2" = model_adjr2, "model_rmse"=model_rmse, "test_rmse"=test_rmse)

knitr::kable(model_list, digits = getOption("digits"), row.names = NA,
  col.names = colnames(model_list), align = "c", caption = NULL, label = NULL,
  format.args = list(), escape = TRUE)
model.names num.of.parameters model_adjr2 model_rmse test_rmse
full_mlr_model 18 0.814725 1700.670 1759.643
exhaustive_search_model 10 0.798801 1782.245 1681.731
log_model 10 0.798801 1782.245 1681.731
log_full_interaction_model 10 0.798801 1782.245 1681.731
log_groupby_year_back_bic 51 0.994384 737.881 783.204
log_groupby_year_back_aic 91 0.994714 703.224 748.628
log_groupby_year_back_bic_fix 51 0.994288 740.657 746.214
log_groupby_year_back_aic_fix 91 0.994847 700.439 686.086
log_int_mod_back_bic_fix 75 0.996128 540.544 583.248
log_int_mod_back_aic_fix 157 0.996685 487.353 484.915

Discussion

The discussion section should contain discussion of your results and should frame your results in the context of the data. How is your final model useful?

Appendix

The appendix section should contain code and analysis that is used, but that would have otherwise cluttered the report or is not directly related to the choice of model. Do not simply dump code in here. Only utilize the appendix to supplement the primary focus of the report. The appendix should also conclude with the names of the group members.

Write in complete sentences and pay attention to grammar, spelling, readability and presentation. If you include a table or chart, make sure you say something about it. If you’re not discussing a result, then it doesn’t belong in your report.

Submit the following three items in a .zip file just as you do in homework assignments.

your selected data, a .Rmd program file, and the project report (.html file) # Reference